PE/PPE mutations in the transmission of Mycobacterium tuberculosis in China revealed by whole genome sequencing

Objective This study aims to examine the impact of PE/PPE gene mutations on the transmission of Mycobacterium tuberculosis (M. tuberculosis) in China. Methods We collected the whole genome sequencing (WGS) data of 3202 M. tuberculosis isolates in China from 2007 to 2018 and investigated the clustering of strains from different lineages. To evaluate the potential role of PE/PPE gene mutations in the dissemination of the pathogen, we employed homoplastic analysis to detect homoplastic single nucleotide polymorphisms (SNPs) within these gene regions. Subsequently, logistic regression analysis was conducted to analyze the statistical association. Results Based on nationwide M. tuberculosis WGS data, it has been observed that the majority of the M. tuberculosis burden in China is caused by lineage 2 strains, followed by lineage 4. Lineage 2 exhibited a higher number of transmission clusters, totaling 446 clusters, of which 77 were cross-regional clusters. Conversely, there were only 52 transmission clusters in lineage 4, of which 9 were cross-regional clusters. In the analysis of lineage 2 isolates, regression results showed that 4 specific gene mutations, PE4 (position 190,394; c.46G > A), PE_PGRS10 (839,194; c.744 A > G), PE16 (1,607,005; c.620T > G) and PE_PGRS44 (2,921,883; c.333 C > A), were significantly associated with the transmission of M. tuberculosis. Mutations of PE_PGRS10 (839,334; c.884 A > G), PE_PGRS11 (847,613; c.1455G > C), PE_PGRS47 (3,054,724; c.811 A > G) and PPE66 (4,189,930; c.303G > C) exhibited significant associations with the cross-regional clusters. A total of 13 mutation positions showed a positive correlation with clustering size, indicating a positive association. For lineage 4 strains, no mutations were found to enhance transmission, but 2 mutation sites were identified as risk factors for cross-regional clusters. These included PE_PGRS4 (338,100; c.974 A > G) and PPE13 (976,897; c.1307 A > C). Conclusion Our results indicate that some PE/PPE gene mutations can increase the risk of M. tuberculosis transmission, which might provide a basis for controlling the spread of tuberculosis. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-024-03352-y.


Background
Tuberculosis (TB) continues to be one of the most prevalent and deadly communicable disease that is a major global health challenge [1].The Covid-19 pandemic has disrupted TB services which left many TB patients undiagnosed and untreated, leading to an increase in TB deaths and transmission [2,3].The evolution and spread of tuberculosis threaten to undermine the success of tuberculosis treatment and control programs [4].In order to effectively combat TB, it is crucial to have a comprehensive understanding of TB's transmission mechanisms.
One hallmark of the Mycobacterium tuberculosis (M.tuberculosis) genome is the presence of the multigenic PE/PPE family of proteins, accounting for about 10% of the coding region of the genome [5].The standard H37Rv has 99 PE genes and 69 PPE genes, which characterized by conserved N-terminal prolineglutamate (PE) and proline-proline-glutamate (PPE) motifs [6].Based on the high polymorphism of C-terminal amino acid sequence, the PE family can be divided into PE_PGRS (polymorphic GC-rich sequences) and PE (with no distinctive features) genes, and the PPE family includes PPE_MPTR (major polymorphic tandem repeats), PPE_SVP (with a GxxSVPxxW motif ), PPE_PPW(with a PxxPxxW motif ) and PPE genes with no distinctive features [5,[7][8][9].The unique sequences of PE/PPE proteins might underlie their specific physiological roles during M. tuberculosis infection.
Many PE/PPE proteins have been shown to play an important role in antigenicity, immune-modulation and virulence in M. tuberculosis [10][11][12].For example, PPE68 and PE35 were identified as required for M. tuberculosis virulence [13,14].Cell necrosis is associated with the spread and virulence of M. tuberculosis because it leads to the release and dissemination of the tuberculin pathogen [15].This function has been reported for PE_PGRS33 [16]and PPE27 [17].PPE39 [18], PE_PGRS5 [19] and PE_ PGRS17 [20] were shown to play roles in host cell interaction and immune regulation.Alternatively, the highly cellular immune response suggests that some PE/PPE proteins may be better diagnostic and vaccine candidates [21].Some genes were found to contribute to enhanced transmission of M. tuberculosis, such as mutation in the ESX-5 type VII secreted protein EsxW [22] and mutation in the lldD2 promoter [23].The role of PE/PPE genes in transmission, albeit less studied, has shown that mutations in PPE54, for instance, contribute to the enhanced spread of the disease in Malawi [24], highlighting their potential significance in the epidemiological dynamics.Here, we have used whole genome sequence data from 3202 Chinese isolates and detected homoplastic single nucleotide polymorphisms (SNPs) in PE/PPE gene region to assess the impact of these homoplastic SNPs on the spread of M. tuberculosis.The result may provide new insights into the impact of the PE/PPE gene mutations on the spread of M. tuberculosis in China.

Sample collection
The M. tuberculosis strains analyzed in this study included two sets of samples (Supplementary Table 1).
(1) Newly sequenced whole genome dataset.We included 1,550 culture-confirmed TB samples with drug susceptibility test (DST) results reported to Shandong Tuberculosis Surveillance System during 2013-2017, and the genome sequence data were deposited in the National Center for Biotechnology Information (NCBI) BioProject database (Accession number is PRJNA1002108).( 2) Countrywide collection of publicly available clinical isolates of M. tuberculosis.We downloaded the WGS data of 1,755 isolates from the European Nucleotide Archive repository.The genomes were sampled from 2007 to 2018, and the geographic distribution covered 30 of the 34 provincial regions of China.Individual patient identifiers were removed before data analysis and reporting.An informed consent waiver and ethical approval were obtained from the Ethics Committee of Shandong Provincial Hospital affiliated to Shandong First Medical University.

Whole-genome sequencing
The genome of the 1468 Shandong isolates was sequenced using Illumina HiSeq 4000.Quality assessment of all acquired reads was performed with FastQC v.0.11.9 (version 0.11.7), and 1447 samples passed quality control [25].Low-quality raw reads from paired-end sequencing were discarded.Reads were then aligned to the H37Rv (NC_000962) reference genome using BWA-MEM (version 0.7.17-r1188) [26].Duplicate reads and clipped alignments were removed with Sam tools markdup (version 1.15) and Samclip (version 0.4.0)[27,28], and only samples with a coverage rate of 98% or higher and a minimum depth of at least 20 were included.The filtered vcf file was annotated with snpEFF (version 4.3t) to obtain the final sample single nucleotide polymorphisms (SNPs) [29].

Homoplastic SNPs identifcation
We used snippy-core (version 4.6.0) to obtain SNPs form entire 3202 isolates.To assign lineages, we analyzed M. tuberculosis WGS data using TBProfiler (version 4.3.0)[30,31].The IQ-TREE (v1.6.12)model "JC + I + G4" used 1000 ultrafast bootstrap replicates and treetime (v0.9.0) to construct and date maximum-likelihood (ML) phylogenetic tree [32].The occurrence of homoplastic mutations can be attributed to convergent evolution driven by selective pressures.Homoplasy analysis was performed employing the HomoplasyFinder software, in adherence to established protocols and methodologies.The SNPs were considered as homoplastic if they occurred independently within different transmission clades and did not form a monophyletic clade based on the provided phylogenetic tree and nucleotide alignment consistency index [33,34].Homoplastic SNPs located in the PE/PPE gene region, with a minor allele frequency (MAF) > 0.005 were included for further analysis.

Transmissibility analysis
Clusters were defined as strains with a genetic distance of 10 SNPs or fewer, indicating recent transmission [22].These clusters were categorized as cross-regional or regional, depending on whether they included strains from two or more of China's seven geographic regions (Northwest, Northern, Northeast, Central, Eastern, Southern, and Southwest China) [35].Furthermore, transmission clusters were subdivided into small clusters (2 isolates), medium-sized clusters (3-6 isolates), and large clusters (more than 6 isolates) based on the number of isolates within each cluster [36,37].

Statistical analysis
To streamline the analysis, homoplastic SNPs with a MAF < 0.005 (approximately 15. isolates) in the PE/PPE gene region were excluded from the analysis.Between clustered and non-clustered, as well as cross-regional and regional clusters, we employed univariate regression analysis and included sites with P values < 0.2 in the subsequent multivariate regression analysis [38].To analyze the effect of PE/PPE gene mutations on cluster size, we conducted a Spearman's rank correlation analysis.The statistical analysis was performed using IBM SPSS Statistics (version 26.0).All reported statistical tests were two-sided, and P values < 0.05 were considered statistically significant.

Sample structure
Of all domestic 3202 strains, 2745 isolates (85.7%) belonged to lineage 2 (94.4% belonged to sublineage 2.2.1), 443 (13.8%) isolates belonged to lineage 4, only 14 isolates belonged to other lineages (lineage 1 and lineage 3).We constructed maximum likelihood phylogenetic trees for lineage 2 and lineage 4 M. tuberculosis isolates, respectively (Fig. 1a and b).The results of strain clustering showed that 1462 strains in lineage 2 were grouped into 446 transmission groups, which were consisted of 2 to 107 isolates.In lineage 4, a total of 52 clusters contained 132 isolates, ranging in size from 2 to 9 isolates (Table 1).It should be noted that lineage 2 strains had a considerably larger proportion of strains in transmission clusters than lineage 4 strains (53.3% vs. 29.8%,P < 0.001).

The effect of PE/PPE gene mutations on transmission of L2 strains
A comprehensive analysis revealed 1,141 homoplastic SNPs in lineage 2 strains, as detailed in Supplementary    2).The 382 strains belonging to lineage 2 formed 77 crossregional clusters, ranging from 2 to 6 geographic regions.Among the 7 geographic regions, Northern China (31.9%) and Central China (27.3%) exhibited the highest proportion of these cross-regional clusters, followed by and Southwest China (23.1%) and Northwest China (19.5%).In the univariate analysis, 57 SNPs exhibited statistically significant differences between cross-regional and regional clusters (P < 0.05).Subsequent multivariate logistic regression analysis identified 10 mutations as influencing factors (P < 0.05), with 4 mutation positions recognized as risk factors for cross-regional clusters, including PE_PGRS10 (839,334; c. 884 3).

The effect of PE/PPE gene mutations on transmission of L4 strains
A total of 205 homoplastic SNPs were detected in lineage 4 strains, as presented in Supplementary Table 4 for reference.After excluding homoplastic SNPs with a MAF below 0.005, 74 SNPs in the PE/PPE gene region were selected for in-depth examination.A significant difference in 6 PE/PPE gene mutation positions was detected between clustered and non-clustered strains, as per the single-factor analysis (P < 0.05).To further investigate these associations, a multivariate logistic regression analysis was conducted, focusing on the 22 loci with P-values less than 0.2 from the initial univariate analysis.However, no mutations were observed in lineage 4 strains that seemed to facilitate transmission, as displayed in Table 3.
Furthermore, 25 lineage 4 strains grouped into 9 crossregional clusters, with strains in each cluster spanning two different geographic regions.After conducting univariate analysis, 9 mutation sites were selected for a multivariate regression analysis, revealing that 4 positions were significantly associated with cross-regional clusters (P < 0.05).PE_PGRS4 (338,100; c.974A > G; OR, 6.090; 95% CI, 1.702-21.793)and PPE13 (976,897; c.1307A > C; OR, 3.505; 95% CI,1.103-11.132)considered as risk factors for cross-regional transmission of strains (Supplementary Table 5).Due to the low prevalence of lineage 4 strains in China and the relatively small sample size, we did not further analyze the transmission cluster size of lineage 4 strains.

Discussion
We analyzed 3202 domestic isolates (including the 1447 Shandong isolates and 1755 publicly available isolates) in this study.The genomic analysis of M. tuberculosis across the country reveals that lineage 2 is the predominant strain, contributing significantly to the tuberculosis burden in China, followed by lineage 4.Moreover, lineage 2 strains had a considerably larger proportion of strains in transmission clusters than lineage 4 strains.In subsequent analyses, we identified 4 PE/PPE gene mutations associated with the spread of lineage 2 strains, but no mutations were found to enhance the transmission of lineage 4 strains.
Homoplastic mutations are mutations independently occurring in different clades of an organism.Homoplastic changes may be a result of convergence evolution due to selective pressures [39].Previous reports have shown that homoplastic SNPs were present in all functional categories of genes, with PE/PPE gene family having the highest ratio of homoplastic SNPs compared to the total SNPs identified in the same functional category [34], but the relationship between these SNPs within the PE/PPE gene region and strain transmission has not been described.In our study, the results of homoplasy analysis of strains showed that 1,141 homoplastic SNPs were identified in strains of lineage 2 and 78 homoplastic SNPs were confirmed in strains of lineage 4, respectively.
The PE/PPE genes are especially abundant in pathogenic mycobacteria, suggesting that they play a major role in mycobacterial survival and pathogenesis, although the precise function of these proteins is largely unknown [6,40].Based on our findings, we observed a missense  (Rv0160c), and another missense mutation (c.620T > G, p.Leu207Arg) at position 1,607,005 of PE16 (Rv1430), which have been associated with increased risk of transmission of M. tuberculosis within lineage 2. Despite the lingering uncertainties surrounding the exact function of PE4, recent study has shed light on its role in enhancing mycobacterial survival within macrophages [41].PE16, a member of the serine hydrolase superfamily with esterase activity, is particularly notable for its ability to hydrolyze short-to medium-chain fatty acid esters [42].Both PE4 and PE16 play pivotal roles in the progression of mycobacterial infections.To fully comprehend the mechanisms through which these mutations facilitate transmission, further in-depth research is imperative.A synonymous mutation at position 839,194 of PE_PGRS10 (Rv0747) (c.744A > G, p.Thr248Thr) has been found to   affect the transmission of lineage 2 isolates.Mazandu and Mulder [43] have predicted that PE_PGRS10 is involved in lipid metabolism, although this has yet to be experimentally verified, an important feature of mycobacterial pathogenicity.This synonym mutation can affect the spread of M. tuberculosis, indicating that the synonym mutations of PE/PPE genes are not all neutral mutations, which is consistent with the previous research results that the synonym mutations of yeast genes are mostly strong non-neutral mutations [44].
The protein functions of PE_PGRS are unique to mycobacteria and are secreted or cell surface associated [45].This suggests that they could be involved in mediating the interaction between the macrophages and the bacteria [46][47][48].PE_PGRS proteins were translocated through the plasmatic membrane in an ESX5-dependent mechanism [49,50], once in the outer membrane PE_ PGRS proteins may "float" on the micromembrane outer leaflet and may possibly be released to exert their activity.The cell wall-anchored/secretory protein PE_PGRS 11 (Rv0754) exhibits a significant interaction with TLR2, driving the maturation and activation of human dendritic cells (DCs), thereby enhancing their capacity to stimulate CD4 + T cells [20].Analysis of the effects of deletion or over-expression of PE_PGRS47 (Rv2741) implicated this protein in the inhibition of autophagy in infected host phagocytes.As a functionally significant and nonredundant bacterial component, PE_PGRS47 contributes to the modulation of both innate and adaptive immune responses.This finding implicates PE_PGRS47 as a potential target for enhancing antigen presentation and fostering protective immunity during vaccination or infection [51].A missense mutation (c.1455G > C, p.Glu485Asp) at position 847,613 of PE_PGRS11, and another missense mutation (c.811A > G, p.Ser271Gly) at position 3,054,724 of PE_PGRS47 have been found to promote the cross-regional spread of lineage 2 isolates.
A previous study of the influence of genomic variants on M. tuberculosis transmission in Malawi showed significant convergent evolution in a mutation in PPE54, which associated with transmission [24].Notably, 68.7% of the strains in this study belonged to lineage 4, and only 3.8% belonged to lineage 2. However, in our study, we did not find that the mutation of PE54 could promote the spread of lineage 4 strains in China.One possible explanation for this could be the bacteriological diversity of the lineage 4 sublineages, which show complex population structure with 21 sublineages and 15 internal groups [35,52].Another possible reason for this difference is the limited sample size of lineage 4, which might have been inadequate to identify less common lineage 4 strains or associated mutations.A larger dataset might potentially yield more precise findings, suggesting that the current sample size might have introduced a degree of inaccuracy.
Our study is the first to explore the association of PE/ PPE gene mutations with M. tuberculosis transmission in China.Nevertheless, we could illustrate some limitations.The distribution of M. tuberculosis isolates in the current dataset may not accurately reflect the prevalence of TB in some areas, but our data have a large sample size and broad geographical distribution which allowed us to analyze how M. tuberculosis spread across the whole country.Additionally, due to the low prevalence of lineage 4 in China and the relatively small sample size, we did not subdivide the transmission cluster sizes of lineage 4 strains.The higher sensitivity of lineage specific analysis may require a larger sample size to be achieved.

Conclusion
In summary, our work highlights two main M. tuberculosis lineages of transmission in China, as well as some PE/ PPE gene mutations can increase the risk of MTB transmission in lineages 2 and 4, respectively, providing valuable insights for the treatment of TB.We believe that the PE/PPE family will remain a highly active area of research with various exciting features yet to be discovered.

Fig. 1
Fig. 1 (a) Phylogenetic tree of 2745 Chinese M. tuberculosis strains in lineage 2. (b) Phylogenetic tree of 443 Chinese M. tuberculosis strains in lineage 4

Fig. 2
Fig. 2 Correlation analysis of PE/PPE gene mutation positions and cluster size

Table 2 .
After excluding SNPs with a MAF below 0.005, a total of 140 homoplastic SNPs from the PE/PPE gene region were selected for further analysis.

Table 2
Analysis of the PE/PPE gene mutations in clustering and non-clustering of lineage 2 mutation (c.46G > A, p.Ala16Thr) at position 190,394 of PE4.
SNP, single nucleotide polymorphisms; MAF, minor allele frequency; OR, Odds ratio; CI, confidence interval; -means there is no result in statistical software or the result was too large and nonsense

Table 3
Analysis of the PE/PPE gene mutations in clustering and non-clustering of lineage 4 SNP, single nucleotide polymorphisms; MAF, minor allele frequency; OR, Odds ratio; CI, confidence interval;